22 research outputs found
Pluralistic Image Completion
Most image completion methods produce only one result for each masked input,
although there may be many reasonable possibilities. In this paper, we present
an approach for \textbf{pluralistic image completion} -- the task of generating
multiple and diverse plausible solutions for image completion. A major
challenge faced by learning-based approaches is that usually only one ground
truth training instance per label. As such, sampling from conditional VAEs
still leads to minimal diversity. To overcome this, we propose a novel and
probabilistically principled framework with two parallel paths. One is a
reconstructive path that utilizes the only one given ground truth to get prior
distribution of missing parts and rebuild the original image from this
distribution. The other is a generative path for which the conditional prior is
coupled to the distribution obtained in the reconstructive path. Both are
supported by GANs. We also introduce a new short+long term attention layer that
exploits distant relations among decoder and encoder features, improving
appearance consistency. When tested on datasets with buildings (Paris), faces
(CelebA-HQ), and natural images (ImageNet), our method not only generated
higher-quality completion results, but also with multiple and diverse plausible
outputs.Comment: 21 pages, 16 figure
T2Net: Synthetic-to-Realistic Translation for Solving Single-Image Depth Estimation Tasks
Current methods for single-image depth estimation use training datasets with
real image-depth pairs or stereo pairs, which are not easy to acquire. We
propose a framework, trained on synthetic image-depth pairs and unpaired real
images, that comprises an image translation network for enhancing realism of
input images, followed by a depth prediction network. A key idea is having the
first network act as a wide-spectrum input translator, taking in either
synthetic or real images, and ideally producing minimally modified realistic
images. This is done via a reconstruction loss when the training input is real,
and GAN loss when synthetic, removing the need for heuristic
self-regularization. The second network is trained on a task loss for synthetic
image-depth pairs, with extra GAN loss to unify real and synthetic feature
distributions. Importantly, the framework can be trained end-to-end, leading to
good results, even surpassing early deep-learning methods that use real paired
data.Comment: 15 pages, 8 figure
IPO-LDM: Depth-aided 360-degree Indoor RGB Panorama Outpainting via Latent Diffusion Model
Generating complete 360-degree panoramas from narrow field of view images is
ongoing research as omnidirectional RGB data is not readily available. Existing
GAN-based approaches face some barriers to achieving higher quality output, and
have poor generalization performance over different mask types. In this paper,
we present our 360-degree indoor RGB panorama outpainting model using latent
diffusion models (LDM), called IPO-LDM. We introduce a new bi-modal latent
diffusion structure that utilizes both RGB and depth panoramic data during
training, but works surprisingly well to outpaint normal depth-free RGB images
during inference. We further propose a novel technique of introducing
progressive camera rotations during each diffusion denoising step, which leads
to substantial improvement in achieving panorama wraparound consistency.
Results show that our IPO-LDM not only significantly outperforms
state-of-the-art methods on RGB panorama outpainting, but can also produce
multiple and diverse well-structured results for different types of masks
What Does Stable Diffusion Know about the 3D Scene?
Recent advances in generative models like Stable Diffusion enable the
generation of highly photo-realistic images. Our objective in this paper is to
probe the diffusion network to determine to what extent it 'understands'
different properties of the 3D scene depicted in an image. To this end, we make
the following contributions: (i) We introduce a protocol to evaluate whether a
network models a number of physical 'properties' of the 3D scene by probing for
explicit features that represent these properties. The probes are applied on
datasets of real images with annotations for the property. (ii) We apply this
protocol to properties covering scene geometry, scene material, support
relations, lighting, and view dependent measures. (iii) We find that Stable
Diffusion is good at a number of properties including scene geometry, support
relations, shadows and depth, but less performant for occlusion. (iv) We also
apply the probes to other models trained at large-scale, including DINO and
CLIP, and find their performance inferior to that of Stable Diffusion
MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation
Although two-stage Vector Quantized (VQ) generative models allow for
synthesizing high-fidelity and high-resolution images, their quantization
operator encodes similar patches within an image into the same index, resulting
in a repeated artifact for similar adjacent regions using existing decoder
architectures. To address this issue, we propose to incorporate the spatially
conditional normalization to modulate the quantized vectors so as to insert
spatially variant information to the embedded index maps, encouraging the
decoder to generate more photorealistic images. Moreover, we use multichannel
quantization to increase the recombination capability of the discrete codes
without increasing the cost of model and codebook. Additionally, to generate
discrete tokens at the second stage, we adopt a Masked Generative Image
Transformer (MaskGIT) to learn an underlying prior distribution in the
compressed latent space, which is much faster than the conventional
autoregressive model. Experiments on two benchmark datasets demonstrate that
our proposed modulated VQGAN is able to greatly improve the reconstructed image
quality as well as provide high-fidelity image generation
Unified Discrete Diffusion for Simultaneous Vision-Language Generation
The recently developed discrete diffusion models perform extraordinarily well
in the text-to-image task, showing significant promise for handling the
multi-modality signals. In this work, we harness these traits and present a
unified multimodal generation model that can conduct both the "modality
translation" and "multi-modality generation" tasks using a single model,
performing text-based, image-based, and even vision-language simultaneous
generation. Specifically, we unify the discrete diffusion process for
multimodal signals by proposing a unified transition matrix. Moreover, we
design a mutual attention module with fused embedding layer and a unified
objective function to emphasise the inter-modal linkages, which are vital for
multi-modality generation. Extensive experiments indicate that our proposed
method can perform comparably to the state-of-the-art solutions in various
generation tasks
Explicit Correspondence Matching for Generalizable Neural Radiance Fields
We present a new generalizable NeRF method that is able to directly
generalize to new unseen scenarios and perform novel view synthesis with as few
as two source views. The key to our approach lies in the explicitly modeled
correspondence matching information, so as to provide the geometry prior to the
prediction of NeRF color and density for volume rendering. The explicit
correspondence matching is quantified with the cosine similarity between image
features sampled at the 2D projections of a 3D point on different views, which
is able to provide reliable cues about the surface geometry. Unlike previous
methods where image features are extracted independently for each view, we
consider modeling the cross-view interactions via Transformer cross-attention,
which greatly improves the feature matching quality. Our method achieves
state-of-the-art results on different evaluation settings, with the experiments
showing a strong correlation between our learned cosine feature similarity and
volume density, demonstrating the effectiveness and superiority of our proposed
method. Code is at https://github.com/donydchen/matchnerfComment: Code and pre-trained models: https://github.com/donydchen/matchnerf
Project Page: https://donydchen.github.io/matchnerf